visual grounding
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)
SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion Ming Dai 1, Lingfeng Y ang
Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning.
- Asia > China > Heilongjiang Province > Daqing (0.04)
- Asia > China > Jiangsu Province > Nanjing (0.04)
CityRefer Datasheet We follow the guidelines of the datasheets for datasets [ 1 ] to explain the composition, collection, recommended use case, and other details of the CityRefer dataset
For what purpose was the dataset created? We created this CityRefer dataset to facilitate research toward city-scale 3D visual grounding. Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., Who funded the creation of the dataset? What do the instances that comprise the dataset represent? CityRefer contains descriptions for 3D visual grounding on large-scale point cloud data.
- Europe > United Kingdom > England > Staffordshire (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Asia > China > Heilongjiang Province > Daqing (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > Middle East > Israel (0.04)
- (2 more...)
- North America > United States > Oregon (0.04)
- North America > Canada (0.04)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- Asia > China > Heilongjiang Province > Daqing (0.04)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- (2 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.68)
Exploiting Contextual Objects and Relations for 3D Visual Grounding
However, this task is challenging due to the necessity to capture 3D contextual information to distinguish target objects from complex 3D scenes. The absence of annotations for contextual objects and relations further exacerbates the difficulties. In this paper, we propose a novel model, CORE-3DVG, to address these challenges by explicitly learning about contextual objects and relations. Our method accomplishes 3D visual grounding via three sequential modular networks, including a text-guided object detection network, a relation matching network, and a target identification network. During training, we introduce a pseudo-label self-generation strategy and a weakly-supervised method to facilitate the learning of contextual objects and relations, respectively. The proposed techniques allow the networks to focus more effectively on referred objects within 3D scenes by understanding their context better. We validate our model on the challenging Nr3D, Sr3D, and ScanRefer datasets and demonstrate state-of-the-art performance.
Look Around and Refer: 2D Synthetic Semantics Knowledge Distillation for 3D Visual Grounding
The main question we address is "can we consolidate the 3D visual stream by 2D clues and efficiently utilize them in both training and testing phases?". The main idea is to assist the 3D encoder by incorporating rich 2D object representations without requiring extra 2D inputs. To this end, we leverage 2D clues, synthetically generated from 3D point clouds, that empirically show their aptitude to boost the quality of the learned visual representations.